---
title: BM25
---

>[BM25 (Wikipedia)](https://en.wikipedia.org/wiki/Okapi_BM25) also known as the `Okapi BM25`, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.
>
>`BM25Retriever` retriever uses the [`rank_bm25`](https://github.com/dorianbrown/rank_bm25) package.

```python
%pip install -qU  rank_bm25
```

```python
from langchain_community.retrievers import BM25Retriever
```

## Create New Retriever with Texts

```python
retriever = BM25Retriever.from_texts(["foo", "bar", "world", "hello", "foo bar"])
```

## Create a New Retriever with Documents

You can now create a new retriever with the documents you created.

```python
from langchain_core.documents import Document

retriever = BM25Retriever.from_documents(
    [
        Document(page_content="foo"),
        Document(page_content="bar"),
        Document(page_content="world"),
        Document(page_content="hello"),
        Document(page_content="foo bar"),
    ]
)
```

## Use Retriever

We can now use the retriever!

```python
result = retriever.invoke("foo")
```

```python
result
```

```output
[Document(metadata={}, page_content='foo'),
 Document(metadata={}, page_content='foo bar'),
 Document(metadata={}, page_content='hello'),
 Document(metadata={}, page_content='world')]
```

## Preprocessing Function

Pass a custom preprocessing function to the retriever to improve search results. Tokenizing text at the word level can enhance retrieval, especially when using vector stores like Chroma, Pinecone, or Faiss for chunked documents.

```python
import nltk

nltk.download("punkt_tab")
```

```python
from nltk.tokenize import word_tokenize

retriever = BM25Retriever.from_documents(
    [
        Document(page_content="foo"),
        Document(page_content="bar"),
        Document(page_content="world"),
        Document(page_content="hello"),
        Document(page_content="foo bar"),
    ],
    k=2,
    preprocess_func=word_tokenize,
)

result = retriever.invoke("bar")
result
```

```output
[Document(metadata={}, page_content='bar'),
 Document(metadata={}, page_content='foo bar')]
```
